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1 Introduction 

This note gives a user-oriented view of Information Extraction (IE). No 
knowledge of language processing is assumed. For a more technical overview 



see 1CL96]. 



Information Extraction is a process which takes unseen texts as input and 
produces fixed- format, unambiguous data as output. This data may be used 
directly for display to users, or may be stored in a database or spreadsheet for 
later analysis, or may be used for indexing purposes in Information Retrieval 
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(IR) applications. 

It is instructive to compare IE and IR: whereas IR simply finds texts and 
presents them to the user, the typical IE application analyses texts and 
presents only the specific information from them that the user is interested 
in. For example, a user of an IR system wanting information on the share 
price movements of companies with holdings in Bolivian raw materials would 
typically type in a list of relevant words and receive in return a set of doc- 
uments (e.g. newspaper articles) which contain likely matches. The user 
would then read the documents and extract the requisite information them- 
selves. They might then enter the information in a spreadsheet and produce 
a chart for a report or presentation. In contrast, an IE system user could, 
with a properly configured application, automatically populate their spread- 
sheet directly with the names of companies and the price movements. 

There are advantages and disadvantages to IE with respect to IR. IE sys- 
tems are more difficult and knowledge-intensive to build, and are to varying 
degrees tied to particular domains and scenarios (see next section) . They are 
also (for most tasks) less accurate than human readers. IE is more compu- 
tationally intensive than IR. However, in applications where there are large 
text volumes IE is potentially much more efficient than IR because of the 
possibility of reducing the amount of time analysts spend reading texts. Also, 
where results need to be presented in several languages, the fixed format, 
unambiguous nature of IE results makes this straightforward in comparison 
with providing full translation facilities. 



2 Types of IE 

There are four types of information extraction (or information extraction 
tasks) currently available (as defined by the leading forum for this research, 
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the Message Understanding Conferences [ GS96| ].). 



Named Entity recognition (NE) 

Finds and classifies names, places etc. 

Coreference Resolution (CO) 

Identifies identity relations between entities in texts. 

Template Element construction (TE) 

Adds descriptive information to NE results. 

Scenario Template production (ST) 

Fits TE results into specified event scenarios. 

From a user point-of-view, NE, TE and ST are the most relevant IE tasks 
(CO, as noted below, is necessary as an adjunct to the other tasks, but is 
of limited direct usefulness to the IE system user). NE, TE and ST provide 
progressively higher-level information about texts. 

These are described in more detail below, after a discussion of the current 
performance levels of IE technology. 



3 Performance levels 

Each of the four types of IE have been the subject of rigorous performance 
evaluation in MUC-6 (1995) and other MUCs, so it is possible to say quite 
precisely how well the current level of technology performs. Below we will 
quote percentage figures quantifying performance levels - they should be 
interpreted as a combined measure of precision and recall (see the section 
on evaluation in [ Adv95| ]). Several caveats should be noted: most of the 



evaluation has been on English (with some Japanese, Chinese and Spanish) 
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- some applications of the technology may be either easier or more difficult 
in other languages. 

The performance of each IE task, and the ease with which it may be devel- 
oped, is to varying degrees dependent on: 

Text type: the kinds of texts we are working with, for example Wall Street 
Journal articles, or email messages, or HTML documents from the 
World Wide Web. 

Domain: the broad subject matter of those texts, e.g. financial news, or 
requests for technical support, or tourist information. 

Scenario: the particular event types that the IE user is interested in, for 
example mergers between companies, or problems experienced with a 
particular software package, or descriptions of how to locate parts of a 
city. 

For example, a particular IE application might be configured to process fi- 
nancial news articles from a particular news provider and find information 
about mergers between companies and various other scenarios. The per- 
formance of the application would be predictable for only this conjunction 
of factors. If it was later required to extract facts from the love letters of 
Napoleon Bonaparte as published on wall posters in the 1871 Paris Com- 
mune, performance levels would no longer be predictable. Tailoring an IE 
system to new requirements is a task that varies in scale dependent on the 
degree of variation in the three factors listed above. 
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4 Named Entity recognition 

The simplest and most reliable IE technology is Named Entity recognition 
(NE). NE systems identify all the names of people, places, organisations, 
dates, and amounts of money. So, for example, if we run the Wall Street 
Journal text in figure [l] through an NE recogniser, the result is as in figure pi 
(this looks better in colour!). (The viewers shown here and below are part of 
the GATE language engineering architecture and development environment 
- see [ CWG96J .) NE recognition can be performed at 96% accuracy; the 



f"_rj" /tmp_mnt/honie/peteiT/gate/Build/a: doc2 



T 



<DOC> 

<DOCID> wsj94_008.0212 </DOCID> 

<DOCNO> 940413-0062 . </D0CN0> 

<HL> Who's News: 

@ Burns Fry Ltd. </HL> 

<DD> 04/13/94 </DD> 

<S0> HALL STREET JOURNAL (J), PAGE BIO </SO> 

<C0> MER </C0> 

<IN> SECURITIES (SCR) </IN> 

<TXT> 

<P> 

BURNS FRY Ltd. (Toronto) — Donald Wright, 46 years old, was 
named executive vice president and director of fixed income at this 
brokerage firm. Mr. Wright resigned as president of Merrill Lynch 
Canada Inc., a unit of Merrill Lynch & Co., to succeed Mark 
Kassirer, 48, who left Burns Fry last month. A Merrill Lynch 
spokeswoman said it hasn't named a successor to Mr. Wright, who is 
expected to begin his new position by the end of the month. 
</p> 
</TXT> 
</DOC> 



Dismiss 



. 



Figure 1: An example text 

current Sheffield system ( |GWH + 95| ]) performs at 92% accuracy. Given that 
human annotators do not perform to the 100% level (measured in MUC by 
inter-annotator comparisons) , NE recognition can now be said to function at 
human performance levels, and applications of the technology are increasing 
rapidly as a result. 

A recent evaluation of NE for Spanish, Japanese and Chinese ( fMOC96] l) 
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f"_£j GATE Viewer — doc2 — Named Entities 
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<DOC> 










<DOCID> 


ws j94 


□03.0212 </DOCID> 






<DOCNO> 


940413-0062. </DOCNO> 






<HL> 


Who's 


News : 














<DD>0 


JHMJ 


</DD> 
IET JOURNAL (J) , PAGE 


BIO 




<S0> WALL STRI 


</SO> 



<CO> MER </CO> 

<IN> SECURITIES (SCR) </IH> 

<TXT> 

<P> 

MMtMMMam (ESSE) — 33HEEEB3. 46 years old, was 
named executive vice presi dent and director of fixed income at this 
brokerage f irm. Mr . flBfjiflg resigned a s pre sident of _ 

a unit of B Q, to succeed [^ 

48, who left § B J- ast month. A — 

spokeswoman said it hasn't named a successor to Mr. miiWWH . who is 

expected to begin his new position by the end of the month. 

</p> 

</TXT> 

</D0C> 



flerrill Lynch 
lark 



Dismiss 
Colour key: All | 



organization 



Figure 2: Named entity recognition 
produced the following scores: 

language best system 

Spanish 93.04 % 

Japanese 92.12 % 

Chinese 84.51 % 

The process is weakly domain dependent, i.e. changing the subject matter of 
the texts being processed from financial news to other types of news would 
involve some changes to the system, and changing from news to scientific 
papers would involve quite large changes. 
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5 Coreference resolution 

Coreference resolution (CO) involves identifying identity relations between 
entities in texts. These entities are both those identified by NE recognition 
and anaphoric references to those entities. For example, in 

Alas, poor Yorick, I knew him well. 

coreference resolution would tie "Yorick" with "him" (and "I" with Hamlet, 
if that information was present in the surrounding text). 

This process is less relevant to users than other IE tasks (i.e. whereas the 
other tasks produce output that is of obvious utility for the application user, 
this task is more relevant to the needs of the application developer). For 
text browsing purposes we might use CO to highlight all occurrences of 
the same object or provide hypertext links between them. CO technology 
might also be used to make links between documents, though this is not 
currently part of the MUC programme. The main significance of this task, 
however, is as a building block for TE and ST (see below). CO enables 
the association of descriptive information scattered across texts with the 
entities to which it refers. To continue the hackneyed Shakespeare example, 
coreference resolution might allow us to situate Yorick in Denmark. Figure 
H shows results for our example text. 

CO resolution is an imprecise process when applied to the solution of anaphoric 
reference. The Sheffield system scored 51% recall and 71% precisionF] at 
MUC-6; other systems scored e.g. 59% recall / 72% precision, 63% recall / 
63% precision. These scores are low (although problems with completing the 
task definition on schedule complicated matters, and led to human scores of 
1 For statistical reasons the combined precision and recall measure we use elsewhere is 
inappropriate here. 
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PAGE BIO </SO> 



<DOC> 

<DOCID> wsj94_008.0212 </DOCID> 
<DOCNO> 940413-0062. </DOCNO> 
<HL > Who's New s : 

e iiim.f i'lmam </hl> 

<DD> 04/13/94 </DD> 

<S0> WALL STREET JOURNAL (J) 

<CO> MER </CO> 

<IN> SECURITIES (SCR) </IN> 

<TXT> 

<P> 

j^^3J^EEl (Toronto) — Donald Wright, 4S years old, was 

named executiv e vice president and director of fixed income at | 

■ffH^TS^KWSff . Mr. Wright resigned as president of Merrill Lynch 

Canada Inc. , a unit of Merrill L ynch £ Co. , to succeed Mark 

Kassirer, 48, who left iSB3!^B3B last month. A Merrill Lynch 

spokeswoman said it hasn't named a successor to Mr. Wright, who is 

expected to begin his new position by the end of the month. 

</P> 

</TXT> 

</D0C> 



T 



Colour key: 



Co-referred items 



Dismiss 



Selected reference chain 



Redisplay 



Figure 3: Coreference resolution 



only around 80%), but note that this hides the difference between proper 
noun coreference identification (same object, different spelling or compound- 
ing, e.g. "IBM", "IBM Europe", "International Business Machines Ltd.", 
. . . ) and anaphora resolution, the former being a significantly easier prob- 
lem. 



CO systems are domain dependent. 



6 Template Element production 



The TE task builds on NE recognition and coreference resolution. In addi- 
tion to locating and typing (i.e. classifying, or assigning to a type - personal 
name, date etc.) entities in documents, TE associates descriptive informa- 
tion with the entities. For example, from the figure |l] text the system finds 
out that Burns Fry Ltd. is located in Toronto, and it adds the information 
that this is in Canada. 
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Template elements for the figure |l] text are given in figure ||. The format is a 



GATE Viewer- 


- doc2 — Template Elements 


"■ 




ORGANIZAT I ON-94041 30062-1 > : 


= 


A 


ORG NAME: 


"BURNS FRY Ltd. " 




ORG ALIAS: 


"Burns Fry Ltd. " 




ORG TYPE: 


COMPANY 




ORG LOCALE: 


Toronto CITY 




ORG COUNTRY: 


Canada 




<ORGANIZATION-9404130062-2> 


: = 




ORG NAME: 


"Merrill Lynch Canada Inc." 




ORG ALIAS: 


"Merrill Lynch & Co." 




ORG TYPE: 


COMPANY 




<PERSON-9404130062-1> := 






PER NAME: 


"Mark Kassirer" 




<PERSON-9404130062-2> := 






PER NAME: 


"Donald Wright" 




PER ALIAS: 


"Wright" 




PERJTITLE: 


"Mr . " 








/ 




Dismiss 



Figure 4: Template elements 

somewhat arbitrary one developed at the behest of the American intelligence 
community (the original target user group of the MUC competitions). It is 
difficult to read; the main point to note is that it is essentially a database 
record, and could just as well be formatted for SQL store operations, or 
reading into a spreadsheet, or (with some extra processing) for multilingual 
presentation. Section M gives a simplified example. 



The current Sheffield system scores 71% for TE production; the best MUC-6 
system scored 80%. Humans achieved 93%. MUC-6 was the first MUC to 
evaluate TE and ST tasks separately - TE scores should improve in future 
as developers gain more experience with the task. 



As in NE recognition, the production of TEs is is weakly domain dependent, 
i.e. changing the subject matter of the texts being processed from financial 
news to other types of news would involve some changes to the system, and 
changing from news to scientific papers would involve quite large changes. 
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7 Scenario Template extraction 

Scenario templates (STs) are the prototypical outputs of IE systems. They 
tie together TE entities into event and relation descriptions. For example, 
TE may have identified Isabelle, Dominique and Frangoise as people entities 
present in the Robert edition of Napoleon's love letters. ST might then 
identify facts such as that Isabelle moved to Paris in August 1802 from Lyon 
to be nearer to the little chap, that Dominique then burnt down Isabelle's 
apartment block and that Frangoise ran off with one of Gerard Depardieu's 
ancestors. A slightly more pertinent example is given in figure & The same 
comments regarding format apply as for the TE task. 

ST is a difficult IE task. The current Sheffield system scores 49% for ST 
production; the best MUC-6 system scored 56%. The human score was 81%, 
which illustrates the complexity involved. These figures should be taken into 
account when considering appropriate applications of ST technology. Note, 
however, that it is possible to increase precision at the expense of recall: 
we can develop ST systems that don't make many mistakes, but that miss 
quite a lot of occurrences of relevant scenarios. Alternatively we can push 
up recall and miss less, but at the expense of making more mistakes. 

The ST task is both domain dependent, and, by definition, tied to the sce- 
narios of interest to the users. Note however that the results of NE and 
TE feed into ST. Note also that in MUC-6 the developers were given the 
specifications for the ST task only 1 month before the systems were scored. 
This was because it was noted that an IE system that required very lengthy 
revision to cope with new scenarios was of less worth than one that could 
meet new specifications relatively rapidly. As a result of this, the scores for 
ST in MUC-6 were probably slightly lower than they might have been with 
a longer development period. Experience from previous MUCs suggests that 
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GATE Viewer — doc2 — Scenario Template 



X 



TEMPLATE-9404130062-1> : = 

DOC_NR: "9404130062" 

CONTENT : 
<SUCCESS I ON_EVENT-94041 30062-1 1> 

SUCCESSION EVENT-9404130062-20> 



<SUCCESSION_EVENT-9404130062- 
<SUCCESSION_EVENT-9404130062- 

SUCCESSION_ORG: 

POST: 

IN_AND_OUT: 

VACANCY_REASON: 
<IN_AND_OUT-9404130062-5> := 

IO_PERSON: 

NEW_STATUS : 

ON_THE_JOB: 
<SUCCESSION_EVENT-9404130062- 

SUCCESSION_ORG: 

POST: 

IN AND OUT: 



VACANCY_REASON: 
<IN_AND_OUT-9404130062-15> := 

IO_PERSON: 

NEW_STATUS : 

ON_THE_JOB: 
<IN_AND_OUT-9404130062-21> : = 

IO_PERSON: 

NEW_STATUS : 

ON_THE_JOB: 
<IN_AND_OUT-9404130062-22> := 

IO_PERSON: 

NEW_STATUS : 

ON_THE_JOB: 
<SUCCESSION_EVENT-9404130062- 

SUCCESSION_ORG: 

POST: 

IN_AND_OUT: 

VACANCY_REASON: 
<IN_AND_OUT-9404130062-31> := 

IO_PERSON: 

NEW_STATUS : 

ON_THE_JOB: 
<ORGANIZAT I ON-94041 30062-1 8> 

ORG_NAME : 

ORG_ALIAS : 

ORG_TYPE : 

ORG_LOCALE : 

ORG COUNTRY: 



30> 
■11> 



<ORGANIZAT I ON-94041 30062-1 8> 
"executive vice president" 
<IN_AND_OUT-9404130062-5> 
OTH_UNK 

<PERSON-9404130062-50> 

IN 

UNCLEAR 

> : = 
<ORGANIZATION-9404130062-28> 
"president" 

<IN_AND_OUT-94041 30062-1 5> 
<IN_AND_OUT-94041 30062-2 1> 
<IN_AND_OUT-9404130062-22> 
REASSIGNMENT 

<PERSON-9404130062-50> 

OUT 

NO 

<PERSON-9404130062-50> 

IN 

UNCLEAR 

<PERSON-9404130062-29> 

OUT 

UNCLEAR 

> : = 

<ORGANIZATION-9404130062-28> 
"president" 

<IN_AND_OUT-94041 30062-3 1> 
REASSIGNMENT 

<PERSON-9404130062-29> 

OUT 

NO 

"BURNS FRY Ltd. " 
"Burns Fry Ltd. " 
COMPANY 
Toronto CITY 
Canada 



30> 



Dismiss 



Figure 5: Scenario template 



current technology has difficulty attaining scores much above 60% accuracy 
for this task, however. 
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8 An Extended Example 

So far we have discussed IE from a general perspective. In this section we 
look at the capabilities that might be delivered as part of an application 
designed to support analysts tracking international drug dealing. 

When the system is specified, our imaginary analyst states that "the op- 
erational domains that user interests are centred around are... drug en- 
forcement, money laundering, organised crime, terrorism, legislation". The 
entities of interest within these domains are cited as "person, company, 
bank, financial entity, transportation means, locality, place, organisation, 
time, telephone, narcotics, legislation, activity". A number of relations (or 
"links") are also specified, for example between people, between people and 
companies, etc. These relations are not typed, i.e. the kind of relation in- 
volved is not specified. Some relations take the form of properties of entities 
- e.g. the location of a company - whilst others denote events - e.g. a person 
visiting a ship. 

Working from this starting point an IE system is designed that: 

1. is tailored to texts dealing with drug enforcement, money laundering, 
organised crime, terrorism, and legislation; 

2. recognises entities in those texts and assigns them to one of a number of 
categories drawn from the set of entities of interest (person, company, 

3. associates certain types of descriptive information with these entities, 
e.g. the location of companies; 

4. identifies a set (relatively small to begin with) of events of interest by 
tying entities together into event relations. 
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For example, consider the following text: 

Reuter - New York, Wednesday 12 July 1996. 

New York police announced today the arrest of Frederick J. 
Thompson, head of Jay Street Imports Inc., on charges of drug 
smuggling. Thompson was taken from his Manhattan apartment 
in the early hours yesterday. His attorney, Robert Giuliani, is- 
sued a statement denying any involvement with narcotics on the 
part of his client. "No way did Fred ever have dealings with 
dope", Guliani said. 

A Jay Street spokesperson said the company had ceased trading 
as of today. The company, a medium-sized import-export con- 
cern established in 1989, had been the main contractor in several 
collaborative transport ventures involving Latin-American pro- 
duce. Several associates of the firm moved yesterday to distance 
themselves from the scandal, including the mid-western trans- 
portation company Downing-Jones. 

Thompson is understood to be accused of importing heroin into 
the United States. 

From this IE might produce information such as the following (in some for- 
mat to be determined according to user requirements, e.g. SQL statements 
addressing some database schema). 

First, a list of entities and associated descriptive information. Relations of 
property type are made explicit. Each entity has an id, e.g. ENTITY-2, which 
can be used for cross-referencing between entities and for describing events 
involving entities. Each also has a type, or category, e.g. company, person. 
Additionally various type-specific information is available, e.g., for dates, a 
normalisation giving the date in standard format. 



Reuter 

id: ENTITY-1 

type : company 

business: news 

New York 
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id: 


ENTITY-2 


type: 


location 


subtype : 
is_in: 


city 
US 


Wednesday 12 July 1996 
id: 


ENTITY-3 


type: 

normal i s at i on : 


date 
12/07/1996 


New York police 




id: 


ENTITY-4 


type: 
location: 


organisation 

ENTITY-2 


Frederick J. Thompson 
id: 


ENTITY-5 


type: 
aliases : 
domicile: 


person 

Thompson; Fred 
ENTITY-7 


profession: 
employer : 


managing director 
ENTITY-6 


Jay Street Imports Inc. 
id: 


ENTITY-6 


type: 
aliases : 


organisation 
Jay Street 


business : 


import-export 


Manhattan 




id: 


ENTITY-7 


type: 
subtype : 
is_in: 


location 
city 

ENTITY-2 


Robert Guliani 




id: 


ENTITY-8 


type: 
aliases : 


person 
Guliani 


1989 




id: 


ENTITY-9 


type: 

normal i s at i on : 


date 
?/?/1989 


Latin- America 




id: 


ENTITY-10 


type: 


location 


subtype : 


country 


Downing- Jones 
id: 


ENTITY- 11 


type: 


organisation 
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heroin 



business : 



id: 
type: 
class : 



United States 
id: 
type: 
subtype : 



transportation 



ENTITY-12 

drug 

A 



ENTITY-13 

location 

country 



(These results correspond to the combination of NE and TE tasks; if we 
removed all but the type slots we would be left with the NE data.) 

Second, relations of event type, or scenarios: 



narcotics-smuggling 
id : 

destination: 
source: 
perpetrators: 
status : 

joint-venture 
id : 
type: 

companies: 
status : 



EVENT- 1 

ENTITY-13 

unknown 

ENTITY-5, ENTITY-6 

on-trial 



EVENT-2 
transport 

ENTITY-6, ENTITY- 11 
past 



(These results correspond to the ST task.) 



9 Multilingual IE 



The results described above may then be translated for presentation to the 
user or for storage in existing databases. In general this task is much easier 
than translation of ordinary text, and is close to software localisation, the 
process of making a program's messages and labels on menus and buttons 
multilingual. Localisation involves storing lists of direct translations for 
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known items. In our case these lists would store translations for words such 
as "entity" , "location" , "date" , "heroin" . We also need ways to display dates 
and numbers in local formats, but code libraries are available for this type 
of problem. 

Problems can arise where arbitrary pieces of text are used in the entity de- 
scription structures, for example the descriptor slot in MUC-6 TE objects. 
Here a noun phrase from the text is extracted, with whatever qualifiers, 
relative clauses etc. happen to be there, so the language is completely unre- 
stricted and would need a full translation mechanism. 
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